This document includes a more complete example of conducting an item response theory (IRT) analysis in R. This work-up includes not only estimation of item and person parameters, but also the creation of item characteristic curves, test information, reliability, and model fit. Most data cleaning and manipulation will utilize a suite a packages known as the tidyverse (Wickham et al., 2019). The IRT analysis will use the mirt package (Chalmers, 2012).

Data Cleaning

The first step of any analysis is to read in the data. For this example, we will use a balanced sample of males and females from a large scale operational assessment. This data set contains 5,000 respondents to 40 items assessing the respondents’ knowledge and understandings of engineering design. Also included is a file of metadata, describing the 40 items.

ied <- read_csv(here("data", "IED_data.csv"),
                col_types = cols(.default = col_integer(),
                                 gender = col_character()))
meta <- read_excel(here("data", "metadata.xlsx"), sheet = "IED")

We next have to determine how each item will be modeled. To do this, we create a new variable called mirt_type. When the items are dichotomously scored (i.e., NIS == 2), we will use the 3-parameter logistic model (3PL; Birnbaum, 1968). For polytomously scored items, we will use the graded response model (GRM; Samejima, 1969, 1972, 1997). The modeling will actually happen with the mirt package (Chalmers, 2019). For a list of available models, see ?mirt().

We also need to clean the response data. In the ied data, missing values have been coded a -9. The na_if function can be used to replace a given value with R’s internal representation of missing values, NA.

clean_meta <- meta %>%
  select(itemid, itemtype, NIS, maxscore) %>%
  mutate(mirt_type = case_when(NIS == 2 ~ "3PL",
                               NIS > 2 ~ "graded"))

clean_ied <- ied %>%
  mutate_all(~na_if(.x, -9))

Estimate IRT Model

To estimate the IRT model, we’ll use the mirt package (Chalmers, 2019). This function requires that the data include only item responses, so we’ll create a data set, model_data, that is the same as the original data but with the studentid and gender columns removed.

model_data <- clean_ied %>%
  select(-studentid, -gender)

We are now ready to estimate the model. The first argument is the data, which we specify as the model_data we just created. Next, the actual model must be specified. Because we are using a unidimensional model, we have only one factor, called F1. This factor is measured by items 1 through the number of columns in our model_data. We use the glue() function to dynamically determine the number of items. In this example, glue("F1 = 1-{ncol(model_data)}") will evaluate to "F1 = 1-40". Then, for each item, we specify what the item type is. We calculated this when we created the clean_meta data, so we can pull that variable out. Note that this assumes the the metadata is in the same order as the columns of model_data. Finally, we’ll set a random seed to make sure we all get the same results (they should be pretty close without this).

model_3pl <- mirt(data = model_data, model = glue("F1 = 1-{ncol(model_data)}"),
                  itemtype = clean_meta$mirt_type,
                  technical = list(set.seed = 9416))

Now we’ve estimated the model!

model_3pl
## 
## Call:
## mirt(data = model_data, model = glue("F1 = 1-{ncol(model_data)}"), 
##     itemtype = clean_meta$mirt_type, technical = list(set.seed = 9416))
## 
## Full-information item factor analysis with 1 factor(s).
## Converged within 0.0001 tolerance after 34 EM iterations.
## mirt version: 1.31 
## M-step optimizer: BFGS 
## EM acceleration: Ramsay 
## Number of rectangular quadrature: 61
## Latent density type: Gaussian 
## 
## Log-likelihood = -128483.3
## Estimated parameters: 154 
## AIC = 257274.7; AICc = 257284.5
## BIC = 258278.3; SABIC = 257788.9

IRT Parameters

The default output is not incredibly useful. What we ultimately want are the estimated item and person parameters. We’ll focus first on item parameters.

Item Parameters

We can extract the item parameters by the using the coef() function. For each item, we see the slope (a1) and intercepts (d, d1, d2, …).

coef(model_3pl)
## $`2487`
##        a1      d     g u
## par 0.023 -1.643 0.119 1
## 
## $`2488`
##        a1      d g u
## par 1.304 -2.388 0 1
## 
## $`2489`
##        a1      d     g u
## par 2.407 -1.665 0.109 1
## 
## $`2490`
##        a1    d1    d2     d3
## par 0.906 3.497 1.208 -0.824
## 
## $`2491`
##        a1      d     g u
## par 0.503 -0.038 0.015 1
## 
## $`2788`
##       a1    d1    d2     d3     d4     d5     d6
## par 0.24 4.254 2.207 -1.129 -1.484 -1.746 -2.085
## 
## $`2790`
##        a1    d1   d2
## par 1.523 3.231 1.16
## 
## $`2792`
##        a1     d     g u
## par 0.006 -1.82 0.106 1
## 
## $`2794`
##        a1      d     g u
## par 0.397 -2.835 0.001 1
## 
## $`2797`
##        a1     d     g u
## par 0.899 0.434 0.003 1
## 
## $`2829`
##        a1      d     g u
## par 0.463 -5.335 0.015 1
## 
## $`2830`
##        a1     d     g u
## par 2.308 1.378 0.094 1
## 
## $`2831`
##        a1      d     g u
## par 0.904 -1.056 0.001 1
## 
## $`2832`
##        a1      d     g u
## par 1.919 -0.408 0.306 1
## 
## $`2841`
##        a1      d     g u
## par 1.503 -1.448 0.178 1
## 
## $`2843`
##       a1      d    g u
## par 1.56 -3.258 0.26 1
## 
## $`2844`
##        a1    d1  d2    d3    d4    d5
## par 1.296 4.545 2.9 1.894 0.409 0.408
## 
## $`2846`
##        a1     d     g u
## par 0.947 0.746 0.009 1
## 
## $`2847`
##        a1      d     g u
## par 0.742 -3.753 0.213 1
## 
## $`2851`
##        a1    d1    d2   d3    d4    d5    d6     d7     d8     d9   d10    d11
## par 1.232 3.878 2.496 1.72 1.155 0.689 0.294 -0.063 -0.367 -0.703 -1.16 -1.255
##        d12    d13
## par -1.832 -1.835
## 
## $`2853`
##        a1    d1    d2
## par 0.972 1.992 0.157
## 
## $`2880`
##        a1      d     g u
## par 2.099 -0.789 0.046 1
## 
## $`2881`
##        a1      d     g u
## par 1.776 -4.869 0.001 1
## 
## $`2882`
##        a1      d     g u
## par 1.861 -0.004 0.001 1
## 
## $`2883`
##        a1      d     g u
## par 1.795 -4.071 0.176 1
## 
## $`2884`
##        a1      d     g u
## par 0.307 -0.532 0.019 1
## 
## $`3072`
##        a1     d     g u
## par 0.176 0.499 0.059 1
## 
## $`3075`
##        a1     d     g u
## par 0.824 1.636 0.025 1
## 
## $`3076`
##        a1      d     g u
## par 0.882 -2.193 0.001 1
## 
## $`3102`
##        a1    d1    d2    d3    d4    d5    d6    d7    d8     d9
## par 2.473 5.039 3.275 2.486 2.049 1.711 1.294 0.864 0.358 -0.497
## 
## $`3107`
##        a1      d     g u
## par 2.954 -4.982 0.043 1
## 
## $`3158`
##        a1      d     g u
## par 1.194 -1.246 0.296 1
## 
## $`3456`
##        a1      d     g u
## par 1.384 -2.066 0.207 1
## 
## $`3482`
##       a1    d1    d2
## par 1.34 3.134 2.384
## 
## $`3484`
##       a1    d1    d2    d3     d4
## par 0.86 2.037 1.096 0.449 -0.212
## 
## $`3485`
##        a1     d     g u
## par 2.753 0.661 0.001 1
## 
## $`3486`
##        a1     d     g u
## par 2.807 0.807 0.001 1
## 
## $`3552`
##        a1    d1    d2
## par 1.785 2.178 1.072
## 
## $`3591`
##        a1      d     g u
## par 0.333 -1.874 0.003 1
## 
## $`5557`
##        a1    d1    d2    d3    d4    d5    d6    d7     d8
## par 2.579 5.048 3.558 2.547 2.044 1.455 0.446 0.001 -0.575
## 
## $GroupPars
##     MEAN_1 COV_11
## par      0      1

Often, it is more useful to think about the parameters using the more well known \(a\), \(b\), and \(c\) parameters. These can be retrieved by setting IRTpars = TRUE.

coef(model_3pl, IRTpars = TRUE)
## $`2487`
##         a      b     g u
## par 0.023 72.255 0.119 1
## 
## $`2488`
##         a     b g u
## par 1.304 1.832 0 1
## 
## $`2489`
##         a     b     g u
## par 2.407 0.692 0.109 1
## 
## $`2490`
##         a     b1     b2    b3
## par 0.906 -3.858 -1.333 0.909
## 
## $`2491`
##         a     b     g u
## par 0.503 0.077 0.015 1
## 
## $`2788`
##        a      b1     b2    b3    b4    b5    b6
## par 0.24 -17.717 -9.192 4.703 6.182 7.273 8.685
## 
## $`2790`
##         a     b1     b2
## par 1.523 -2.122 -0.762
## 
## $`2792`
##         a       b     g u
## par 0.006 314.198 0.106 1
## 
## $`2794`
##         a     b     g u
## par 0.397 7.135 0.001 1
## 
## $`2797`
##         a      b     g u
## par 0.899 -0.483 0.003 1
## 
## $`2829`
##         a      b     g u
## par 0.463 11.517 0.015 1
## 
## $`2830`
##         a      b     g u
## par 2.308 -0.597 0.094 1
## 
## $`2831`
##         a     b     g u
## par 0.904 1.169 0.001 1
## 
## $`2832`
##         a     b     g u
## par 1.919 0.213 0.306 1
## 
## $`2841`
##         a     b     g u
## par 1.503 0.964 0.178 1
## 
## $`2843`
##        a     b    g u
## par 1.56 2.088 0.26 1
## 
## $`2844`
##         a     b1     b2     b3     b4     b5
## par 1.296 -3.508 -2.238 -1.461 -0.316 -0.315
## 
## $`2846`
##         a      b     g u
## par 0.947 -0.788 0.009 1
## 
## $`2847`
##         a     b     g u
## par 0.742 5.056 0.213 1
## 
## $`2851`
##         a     b1     b2     b3     b4    b5     b6    b7    b8    b9   b10
## par 1.232 -3.148 -2.027 -1.397 -0.937 -0.56 -0.239 0.051 0.298 0.571 0.942
##       b11   b12  b13
## par 1.019 1.487 1.49
## 
## $`2853`
##         a    b1     b2
## par 0.972 -2.05 -0.161
## 
## $`2880`
##         a     b     g u
## par 2.099 0.376 0.046 1
## 
## $`2881`
##         a     b     g u
## par 1.776 2.742 0.001 1
## 
## $`2882`
##         a     b     g u
## par 1.861 0.002 0.001 1
## 
## $`2883`
##         a     b     g u
## par 1.795 2.267 0.176 1
## 
## $`2884`
##         a     b     g u
## par 0.307 1.735 0.019 1
## 
## $`3072`
##         a      b     g u
## par 0.176 -2.837 0.059 1
## 
## $`3075`
##         a      b     g u
## par 0.824 -1.985 0.025 1
## 
## $`3076`
##         a     b     g u
## par 0.882 2.485 0.001 1
## 
## $`3102`
##         a     b1     b2     b3     b4     b5     b6    b7     b8    b9
## par 2.473 -2.038 -1.325 -1.005 -0.828 -0.692 -0.523 -0.35 -0.145 0.201
## 
## $`3107`
##         a     b     g u
## par 2.954 1.686 0.043 1
## 
## $`3158`
##         a     b     g u
## par 1.194 1.043 0.296 1
## 
## $`3456`
##         a     b     g u
## par 1.384 1.493 0.207 1
## 
## $`3482`
##        a     b1     b2
## par 1.34 -2.338 -1.779
## 
## $`3484`
##        a     b1     b2     b3    b4
## par 0.86 -2.369 -1.275 -0.522 0.246
## 
## $`3485`
##         a     b     g u
## par 2.753 -0.24 0.001 1
## 
## $`3486`
##         a      b     g u
## par 2.807 -0.287 0.001 1
## 
## $`3552`
##         a    b1   b2
## par 1.785 -1.22 -0.6
## 
## $`3591`
##         a     b     g u
## par 0.333 5.635 0.003 1
## 
## $`5557`
##         a     b1    b2     b3     b4     b5     b6     b7    b8
## par 2.579 -1.958 -1.38 -0.988 -0.793 -0.564 -0.173 -0.001 0.223
## 
## $GroupPars
##     MEAN_1 COV_11
## par      0      1

The last problem to solve is that the coefficients are returned in a list format. This is done because not every items has the same set of parameters. Specifically, items will have different numbers of \(b\) parameters, depending on how many score categories are present. However, with some tidyverse magic, we can create a data frame that has one row per item, with all of the \(b\) parameters nested together.

item_params <- coef(model_3pl, IRTpars = TRUE) %>%
  list_modify(GroupPars = zap()) %>%
  imap_dfr(function(x, y) {
    as_tibble(x) %>%
      add_column(itemid = y, .before = 1) %>%
      nest(b = starts_with("b")) %>%
      mutate(b = map(b, function(z) {
        z %>%
          pivot_longer(cols = everything(), names_to = "param",
                       values_to = "value") %>%
          deframe()
      })) %>%
      select(itemid, a, b, everything())
  }) %>%
  rename(c = g) %>%
  replace_na(list(c = 0, u = 1))

item_params
## # A tibble: 40 x 5
##    itemid       a b                c     u
##    <chr>    <dbl> <list>       <dbl> <dbl>
##  1 2487   0.0227  <dbl [1]> 0.119        1
##  2 2488   1.30    <dbl [1]> 0.000212     1
##  3 2489   2.41    <dbl [1]> 0.109        1
##  4 2490   0.906   <dbl [3]> 0            1
##  5 2491   0.503   <dbl [1]> 0.0146       1
##  6 2788   0.240   <dbl [6]> 0            1
##  7 2790   1.52    <dbl [2]> 0            1
##  8 2792   0.00579 <dbl [1]> 0.106        1
##  9 2794   0.397   <dbl [1]> 0.00147      1
## 10 2797   0.899   <dbl [1]> 0.00341      1
## # … with 30 more rows

Person Parameters

We are also likely interested in the person parameters, or the respondent ability estimates. We can extract the ability estimates using the fscores() function. When then do some formatting to get the scores into a nice data frame, and add a studentid column so we can keep track of which ability estimate goes with each respondent.

person_params <- fscores(model_3pl) %>%
  as_tibble(.name_repair = ~"theta") %>%
  rowid_to_column(var = "studentid")

person_params
## # A tibble: 5,000 x 2
##    studentid  theta
##        <int>  <dbl>
##  1         1 -0.149
##  2         2  0.575
##  3         3  1.30 
##  4         4  0.490
##  5         5  0.900
##  6         6 -0.617
##  7         7 -0.246
##  8         8 -0.897
##  9         9  0.933
## 10        10  0.474
## # … with 4,990 more rows

Test Characteristics

To explore the assessment further, we will examine the item characteristic curves. For dichotomous items, these plots display the probability of providing a correct response, across the range of ability. For polytomous items, these plots show, across the range of ability, the probability of scoring in each category. To create the plots, we define range of ability we are interested in, and then calculate the probabilities at each ability level using the custom icc_calc() function.

iccs <- crossing(theta = seq(-3, 3, by = 0.01),
                 itemid = colnames(model_data)) %>%
  left_join(item_params, by = "itemid") %>%
  future_pmap_dfr(icc_calc, .progress = TRUE) %>%
  nest(icc_data = -c(itemid))

In addition to item characteristics, we can also look at test level characteristics. It’s often common to examine the distributions of raw scores and the estimated ability estimates. First, we calculate the total score for each student, and then join the estimated ability parameters (called \(\theta\)). Then we can create a plot to compare the distributions using ggplot2 (Wickham, 2016; Wickham et al., 2020), shown in Figure 1.

test_dist <- clean_ied %>%
  pivot_longer(cols = -c(studentid, gender),
               names_to = "item", values_to = "score") %>%
  group_by(studentid, gender) %>%
  summarize(`Raw~Score` = sum(score, na.rm = TRUE)) %>%
  left_join(person_params, by = "studentid") %>%
  pivot_longer(cols = -c(studentid, gender),
               names_to = "measure", values_to = "score")

ggplot(test_dist, aes(x = score)) +
  geom_histogram(bins = 20, color = "black", alpha = 0.8) +
  facet_wrap(~measure, scales = "free", labeller = label_parsed) +
  labs(x = expression(theta), y = "Respondents") +
  theme_bw()
Distributions of assessment scores.

Figure 1: Distributions of assessment scores.

Another important aspect of an assessment scaled with IRT is the test information function, which is the basis for the conditional standard error of measurement. The item information can be calculated using the custom info_calc() function. We then aggregate across all items to get the test information, and calculate the associated standard error. Finally, we can plot both functions, as shown in Figure 2.

info <- crossing(theta = seq(-3, 3, by = 0.01),
                      itemid = colnames(model_data)) %>%
  left_join(item_params, by = "itemid") %>%
  future_pmap_dfr(info_calc, .progress = TRUE)

test_info <- info %>%
  group_by(theta) %>%
  summarize(Information = sum(info)) %>%
  mutate(`Standard Error` = 1 / sqrt(Information)) %>%
  pivot_longer(cols = -theta, names_to = "measure", values_to = "value")

ggplot(test_info, aes(x = theta, y = value)) +
  geom_line(size = 2) +
  scale_x_continuous(breaks = seq(-3, 3, by = 1)) +
  expand_limits(y = 0) +
  labs(x = expression(theta), y = "Value") +
  facet_wrap(~measure, scales = "free") +
  theme_bw()
Test information function and associated conditional standard error of measurement.

Figure 2: Test information function and associated conditional standard error of measurement.

Model Fit

For any statistical analysis, it is important to assess how well the estimated model represents, or fits, the observed data. In general, there are two types of model fit: absolute and relative fit. Absolute fit directly assesses how well the estimated parameters reflect the data. This can be assessed at the item and model level. Relative fit is used to compare competing models to determine which should be preferred.

Item-Level Absolute Fit

For assessing item-level fit, we can calculate standardized residuals (known as the \(Q1\) statistic; Yen, 1981). This is done by splitting respondents into quadrature nodes (in this example 10), and then comparing the observed performance of respondents in that group to the model expected performance. Using the observed and expected frequencies, we can then calculate a \(\chi^2\) statistic to determine whether the item shows acceptable model fit. A significant \(\chi^2\) test indicates poor model fit. This measure is visualized below, and all statistics are summarized in Table 1.

person_groups <- person_params %>%
  mutate(group = cut_interval(theta, n = 10),
         quad = as.numeric(group))

quad_summary <- clean_ied %>%
  pivot_longer(cols = -c(studentid, gender),
               names_to = "item", values_to = "score") %>%
  filter(!is.na(score)) %>%
  left_join(select(person_groups, studentid, theta, quad), by = "studentid") %>%
  group_by(quad, item) %>%
  summarize(n = n(), total_score = sum(score), total_theta = sum(theta)) %>%
  ungroup() %>%
  mutate(prop = case_when(n < 5 ~ NA_real_,
                          TRUE ~ total_score / n),
         theta = case_when(n < 5 ~ NA_real_,
                           TRUE ~ total_theta / n))

srs <- quad_summary %>%
  select(quad, n, prop, theta, itemid = item) %>%
  left_join(item_params, by = "itemid") %>%
  pmap_dfr(sr_calc) %>%
  nest(sr_data = -c(itemid)) %>%
  left_join(iccs, by = "itemid")
srs %>%
  select(-icc_data) %>%
  unnest(sr_data) %>%
  group_by(itemid) %>%
  summarize(n = sum(n), chisq = sum(chisq),  df = sum(df)) %>%
  mutate(n = prettyNum(n, big.mark = ","),
         pvalue = pchisq(chisq, df, lower.tail = FALSE),
         chisq = sprintf("%0.1f", chisq),
         print_pvalue = sprintf("%0.3f", pvalue),
         print_pvalue = case_when(print_pvalue == "0.000" ~ "<0.001",
                                  TRUE ~ paste0(print_pvalue))) %>%
  mutate(print_pvalue = cell_spec(print_pvalue, "html", color = ifelse(pvalue < .05, "black", "grey"),
                                  background = ifelse(pvalue < .05, "#E14646", "#FFFFFF"))) %>%
  select(-pvalue) %>%
  kable(align = c("c", rep("r", 4)), booktabs = TRUE,
        col.names = c("Item", "$\\pmb{n}$", "$\\pmb{\\chi^2}$", "df", "$\\pmb{p}$-value"),
        caption = "Item-Level Standardized Residuals", escape = FALSE) %>%
  kable_styling() %>%
  row_spec(0, bold = TRUE)
Table 1: Item-Level Standardized Residuals
Item \(\pmb{n}\) \(\pmb{\chi^2}\) df \(\pmb{p}\)-value
2487 4,771 26.9 10 0.003
2488 4,723 12.5 10 0.252
2489 4,759 17.5 10 0.064
2490 4,760 24.5 10 0.006
2491 4,761 9.4 10 0.492
2788 4,908 527.9 10 <0.001
2790 4,908 20.1 10 0.028
2792 4,908 7.5 10 0.680
2794 4,900 14.2 10 0.166
2797 4,881 22.4 10 0.013
2829 4,999 3.1 10 0.978
2830 4,953 12.6 10 0.248
2831 4,938 17.7 10 0.060
2832 4,970 4.8 10 0.902
2841 4,784 12.5 10 0.256
2843 4,783 8.2 10 0.607
2844 4,774 110.6 10 <0.001
2846 4,901 8.3 10 0.599
2847 4,890 9.9 10 0.454
2851 4,828 575.6 10 <0.001
2853 4,852 24.7 10 0.006
2880 4,831 11.1 10 0.352
2881 4,781 13.7 10 0.185
2882 4,773 10.9 10 0.364
2883 4,805 9.6 10 0.475
2884 4,812 5.4 10 0.860
3072 4,969 21.0 10 0.021
3075 4,947 13.3 10 0.207
3076 4,945 3.8 10 0.955
3102 4,730 680.2 10 <0.001
3107 4,770 49.4 10 <0.001
3158 4,781 9.5 10 0.484
3456 4,786 9.5 10 0.481
3482 4,964 46.6 10 <0.001
3484 4,919 204.4 10 <0.001
3485 4,750 7.7 10 0.654
3486 4,725 8.7 10 0.563
3552 4,966 54.8 10 <0.001
3591 4,896 23.1 10 0.010
5557 2,974 242.5 10 <0.001

Model-Level Absolute Fit

At the model level, we also assess fit by using residuals. For every respondent and item, we can calculate the model expectation and then compare to the observed score. As with any model, the difference between the two is the residual. In this example, we’ll calculate the residuals using the custom expected_calc() function.

residuals <- clean_ied %>%
  pivot_longer(cols = -c(studentid, gender),
               names_to = "itemid", values_to = "obs_score") %>%
  filter(!is.na(obs_score)) %>%
  left_join(person_params, by = "studentid") %>%
  left_join(item_params, by = "itemid") %>%
  future_pmap_dfr(expected_calc, .progress = TRUE) %>%
  select(studentid, gender, itemid, obs_score, exp_score, residual)

From these residuals we can calculate the \(Q3\) statistic (Yen, 1984). One assumption of the IRT models is that residuals should be should be random and uncorrelated, normally distributed, and be close to zero on average. If residuals of pairs of items are correlated, this may indicate additional dimensioinality not captured by the model, called local item dependence (LID). The \(Q3\) statistic is the correlation between the residuals of a pair of items. Thus, for this 40 item test there are 780 \(Q3\) statistics (the number of elements in the lower triangle of the correlation matrix). As a general rule, \(Q3\) values larger than \(\pm 0.2\) are considered serious violations. Positive values indicate items share additional common dimensionality, and negative values indicate extra dimensionality not shared by those particular items. Using our residuals data, we can calculate the residual correlations as shown below.

corrs <- residuals %>%
  distinct() %>%
  select(studentid, itemid, residual) %>%
  pivot_wider(names_from = itemid, values_from = residual) %>%
  select(-studentid) %>%
  cor(use = "pairwise.complete.obs")

corrs <- corrs %>%
  as_tibble(rownames = "item1") %>%
  pivot_longer(cols = -item1, names_to = "item2", values_to = "cor") %>%
  mutate(lower_tri = as.vector(lower.tri(corrs)))  %>%
  filter(lower_tri) 

q3 <- pull(corrs, cor)

The distribution of the \(Q3\) statistics for this assessment is shown in Figure 3. Overall, the distribution has a mean of -0.02 and a standard deviation of 0.05, with 7 correlations outside the \(\pm 0.2\) range. Thus, there is a strong indication of additional dimensionality not captured by the model. The correlations outside of \(\pm 0.2\) are shown in Table 2.

ggplot() +
  geom_histogram(aes(x = q3, y = stat(density)), boundary = 0, binwidth = 0.02,
                 color = "black",  alpha = 0.8) +
  scale_x_continuous(breaks = seq(-0.5, 0.5, 0.1)) +
  labs(x = "*Q3*", y = "Density") +
  theme_bw() +
  theme(axis.title.x = element_markdown())
Distribution of local item dependence \(Q3\) statistics.

Figure 3: Distribution of local item dependence \(Q3\) statistics.

corrs %>%
  filter(!between(cor, -0.2, 0.2)) %>%
  select(-lower_tri) %>%
  mutate(cor = sprintf("%0.2f", cor)) %>%
  kable(align = c("c", "c", "r"), booktabs = TRUE,
        col.names = c("Item 1", "Item 2", "Q3"),
        caption = "Flagged \(Q3\) Statistics") %>%
  kable_styling() %>%
  row_spec(0, bold = TRUE)
Table 2: Flagged \(Q3\) Statistics
Item 1 Item 2 Q3
2830 3102 -0.22
2830 3552 0.33
2830 5557 -0.22
2841 3158 0.29
3102 3552 -0.25
3102 5557 0.49
3552 5557 -0.23

We can also examine the distribution of standardized residuals that we calculated earlier. Because the residuals have been standardized, we expect this distribution to follow a standard normal distribution, \(\mathcal{N}(\mu=0,\ \sigma= 1)\). In Figure 4, the distribution of the residuals is shown in black and the standard normal distribution in blue. The residuals have a mean of -0.12 and a standard deviation of 2.69. This is reflected in Figure 4, where we see the observed distribution has wider variability than the expected distribution, indicating that misfit is present in our model.

all_sr <- srs %>%
  select(-icc_data) %>%
  unnest(sr_data) %>%
  pull(sr)

ggplot(mapping = aes(x = all_sr)) +
  geom_histogram(aes(y = stat(density)), bins = 30, color = "black",
                 alpha =  0.3) +
  geom_density(color = "black", fill = "black", size = 2, alpha =  0.5) +
  stat_function(fun = dnorm, n = 500, args = list(mean =  0, sd = 1),
                color = "#56B4E9", size = 2, alpha = 0.8) +
  labs(x = "Standardized Residuals", y = "Density") +
  theme_bw()
Distribution of standardized residuals.

Figure 4: Distribution of standardized residuals.

Relative Fit

Model comparisons are used to evaluate the relative fit of two more models. That is, given multiple models, which one should we prefer? Due to the likely multidimensionality we observed in our original unidimensional model, we’ll estimate a multidimensional model to compare it to. In the multidimensional model, we’ll still allow all items to measure a general factor, now called G. We’ll also add four additional factors, defined by the contentcode{#} variables in our item meta data, meta.

mirt_spec <- glue("G = 1-{ncol(model_data)}
                   C1 = 2-5,10,15-18,20-21,29,32-34
                   C3 = 2,4-7,9,13,18,20-21,27-29,35,39
                   C4 = 3,12-14,22-26,31,38,
                   C5 = 15-17,30-33,36-37,40")

model_mirt <- mirt(data = model_data, model = mirt_spec,
                  itemtype = clean_meta$mirt_type, method = "QMCEM",
                  technical = list(set.seed = 9416))

The mirt package offers several methods for comparing models. In addition to the omnibus likelihood ratio test, there are also five fit indices based on information criteria: Akaike information criterion (AIC; Akaike, 1974), sample size adjusted AIC (AICc; Hurvich & Tsai, 1989), Bayesian information criterion (BIC; Schwarz, 1978), sample size adjusted BIC (SABIC; Sclove, 1987), and the Hannan-Quinn criterion (HQ; Hannan & Quinn, 1979). These are accessed by using the anova() function.

model_compare <- anova(model_3pl, model_mirt) %>%
  as_tibble() %>%
  mutate(Model = c("UIRT", "MIRT"),
         p = sprintf("%0.3f", p),
         p = case_when(p == "0.000" ~ "< 0.001",
                       TRUE ~ p)) %>%
  mutate_if(is.double, ~sprintf("%0.0f", .x))
## 
## Model 1: mirt(data = model_data, model = glue("F1 = 1-{ncol(model_data)}"), 
##     itemtype = clean_meta$mirt_type, technical = list(set.seed = 9416))
## Model 2: mirt(data = model_data, model = mirt_spec, itemtype = clean_meta$mirt_type, 
##     method = "QMCEM", technical = list(set.seed = 9416))

The information based fit indices for model comparisons are shown in Table 3. For the information criteria, a lower value indicates better fit. For all indices, the multidimensional model has lower values, indicating this model is preferred. The \(\chi^2\) is also significant (\(\chi^2_{(51)} = 2517\), \(p < 0.001\)). Here the 51 degrees of freedom represent the 51 parameters that are estimated in the multidimensional model that are not estimated in the unidimensional model, and the significant p-value indicates that adding the constraints has a negative impact on model fit. In sum, all model comparisons indicate the multidimensional model provides a better representation of the data than the unidimensional data, even after accounting for model complexity. However, this does not tell us whether or not the multidimensional model has adequate fit to the data. Model comparisons are most effective when you have multiple models show adequate absolute fit to the data. Then, relative fit indices can be used to find the perfered model among those that fit the data.

model_compare %>%
  select(Model, AIC, AICc, BIC, SABIC, HQ, Loglikelihood = logLik) %>%
  kable(align = "c", booktabs = TRUE,
        caption = "Relative Fit Indices") %>%
  kable_styling() %>%
  row_spec(0, bold = TRUE) %>%
  footnote(general = "UIRT = Unidimensional IRT; MIRT = Mulidimensional IRT",
           footnote_as_chunk = TRUE, general_title = "Note.")
Table 3: Relative Fit Indices
Model AIC AICc BIC SABIC HQ Loglikelihood
UIRT 257275 257285 258278 257789 257626 -128483
MIRT 254860 254877 256196 255544 255328 -127225
Note. UIRT = Unidimensional IRT; MIRT = Mulidimensional IRT

Other Methods for IRT in R

The mirt package is not the only way to estimate IRT models in R. Choi & Asilkalkan (2019) provides a thorough overview of 45 R packages that can be used to conduct analyses using IRT including differential item functioning and equating. The paper also discusses the features available in each package, making this an excellent resource when trying to find a package to complete a specific analysis.

For more advanced work and better methods for model fit, Bayesian modeling offers much more flexibility. The Stan probabilistic programming language (Carpenter et al., 2017) offers one way to define these models in R with the rstan package (Guo, Gabry, & Goodrich, 2019). The brms package also offers an interface to Stan for estimating linear and non-linear multilevel models, without having to learn the Stan language (Bürkner, 2017, 2018, 2020). Bürkner (2019) provides a comprehensive overview of how to specify, estimate, and evaluate IRT models, along with comparisons to other R packages.

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. https://doi.org/10.1109/TAC.1974.1100705

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Reading, MA: Addison-Wesley.

Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017

Bürkner, P.-C. (2019). Bayesian item response modeling in R with brms and Stan. Retrieved from http://arxiv.org/abs/1905.09501v3

Bürkner, P.-C. (2020). brms: Bayesian regression models using ’stan’. Retrieved from https://CRAN.R-project.org/package=brms

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., … Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76. https://doi.org/10.18637/jss.v076.i01

Chalmers, P. (2019). mirt: Multidimensional item response theory. Retrieved from https://CRAN.R-project.org/package=mirt

Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06

Choi, Y.-J., & Asilkalkan, A. (2019). R packages for item response theory analysis: Descriptions and features. Measurement: Interdisciplinary Research and Perspectives, 17, 168–175. https://doi.org/10.1080/15366367.2019.1586404

Guo, J., Gabry, J., & Goodrich, B. (2019). RStan: R interface to Stan. Retrieved from https://CRAN.R-project.org/package=rstan

Hannan, E. J., & Quinn, B. G. (1979). The determination of the order of an autoregression. Journal of the Royal Statistical Society: Series B (Methodological), 41, 190–195. https://doi.org/10.1111/j.2517-6161.1979.tb01072.x

Hurvich, C. M., & Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76, 297–307. https://doi.org/10.1093/biomet/76.2.297

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph No. 17). Richmond, VA: Psychometric Society.

Samejima, F. (1972). A general model for free-response data (Psychometrika Monograph No. 18). Richmond, VA: Psychometric Society.

Samejima, F. (1997). Graded response model. In W. J. Van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). New York, NY: Springer.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. https://doi.org/10.1214/aos/1176344136

Sclove, S. L. (1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika, 52, 333–343. https://doi.org/10.1007/BF02294360

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from https://ggplot2.tidyverse.org

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … Dunnington, D. (2020). ggplot2: Create elegant data visualisations using the grammar of graphics.

Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245–262. https://doi.org/10.1177/014662168100500212

Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145. https://doi.org/10.1177/014662168400800201